Method and system for low-overhead data transfer

ABSTRACT

A method for low-overhead data transfer. The method includes initiating, by a first application, TCP connection with a second application, establishing, in response to the initiation, the TCP connection between the first application and the second application, providing, by the first application, pre-post buffer information to the second application, where the pre-post buffer information corresponds to a location in a physical memory of the first computer and where the location in physical memory corresponds to a virtual memory address of the first application. The method further comprising transferring data, by the second application, to the first application using the pre-post buffer information, where transferring the data comprises writing the data directly into a location in the physical memory of the first computer.

The present application contains subject matter that may be related tothe subject matter in the following U.S. application filed on Dec. 10,2007, and assigned to the assignee of the present application: “Methodand System for Enforcing Resource Constraints for Virtual Machinesacross Migration” with U.S. application Ser. No. 11/953,839 .

BACKGROUND

Conventionally, in the computer-related arts, a network is anarrangement of physical computer systems configured to communicate witheach other. In some cases, the physical computer systems include virtualmachines, which may also be configured to interact with the network(i.e., communicate with other physical computers and/or virtual machinesin the network). Many different types of networks exist, and a networkmay be classified based on various aspects of the network, such asscale, connection method, functional relationship of computer systems inthe network, and/or network topology.

Regarding connection methods, a network may be broadly categorized aswired (using a tangible connection medium such as Ethernet cables) orwireless (using an intangible connection medium such as radio waves).Different connection methods may also be combined in a single network.For example, a wired network may be extended to allow devices to connectto the network wirelessly. However, core network components such asrouters, switches, and servers are generally connected using physicalwires. Ethernet is defined within the Institute of Electrical andElectronics Engineers (IEEE) 802.3 standards, which are supervised bythe IEEE 802.3 Working Group.

To create a wired network, computer systems must be physically connectedto each other. That is, the ends of physical wires (for example,Ethernet cables) must be physically connected to network interface cardsin the computer systems forming the network. To reconfigure the network(for example, to replace a server or change the network topology), oneor more of the physical wires must be disconnected from a computersystem and connected to a different computer system.

Further, when transferring data between computer systems in a network,one or more network protocols are typically used to help ensure the dataare transferred successfully. For example, network protocols may usechecksums, small data packets, acknowledgments, and other data integrityfeatures to help avoid data loss or corruption during the data transfer.The amount of data integrity features required in the networkprotocol(s) generally depends on the type of data being transferred andthe quality of the connection(s) between the computer systems.

SUMMARY

In general, in one aspect, the invention relates to a method forlow-overhead data transfer. The method includes initiating, by a firstapplication, a Transmission Communication Protocol (TCP) connection witha second application, wherein the first application is executing on afirst computer in a first virtual machine, the second application isexecuting on a second computer in a second virtual machine, and thefirst computer and the second computer are located on a chassis andcommunicate over a chassis interconnect, establishing, in response tothe initiation, the TCP connection between the first application and thesecond application, determining that the first computer and secondcomputer are located on the chassis, providing, by the firstapplication, pre-post buffer information to the second application,wherein the pre-post buffer information corresponds to a location in aphysical memory of the first computer and wherein the location inphysical memory corresponds to a virtual memory address of the firstapplication, and transferring data, by the second application, to thefirst application using the pre-post buffer information, whereintransferring the data comprises writing the data directly into thelocation in the physical memory of the first computer.

In general, in one aspect, the invention relates to a system. The systemincludes a chassis interconnect and a first application is executing ona first computer in a first virtual machine and a second application isexecuting on a second computer in a second virtual machine, wherein thefirst computer and the second computer are located on a chassis andcommunicate over the chassis interconnect, wherein the first applicationis configured to initiate a Transmission Communication Protocol (TCP)connection with the second application, wherein, in response to theinitiation, the TCP connection is established between the firstapplication and the second application, wherein the first application isconfigured to provide pre-post buffer information to the secondapplication after the first application is determined to be executing onthe same chassis as the second application, wherein the pre-post bufferinformation corresponds to a location in a physical memory of the firstcomputer and wherein the location in physical memory corresponds to avirtual memory address of the first application, and wherein the secondapplication transfers data to the first application using the pre-postbuffer information, wherein transferring the data comprises writing thedata directly into the location in the physical memory of the firstcomputer.

In general, in one aspect, the invention relates to a computer readablemedium comprising a plurality of executable instructions forlow-overhead data transfer, wherein the plurality of executableinstructions comprises instructions to initiate, by a first application,a Transmission Communication Protocol (TCP) connection with a secondapplication, wherein the first application is executing on a firstcomputer in a first virtual machine, the second application is executingon a second computer in a second virtual machine, and the first computerand the second computer are located on a chassis and communicate over achassis interconnect, establish, in response to the initiation, the TCPconnection between the first application and the second application,determine that the first computer and second computer are located on thechassis, provide, by the first application, pre-post buffer informationto the second application, wherein the pre-post buffer informationcorresponds to a location in a physical memory of the first computer andwherein the location in physical memory corresponds to a virtual memoryaddress of the first application, and transfer data, by the secondapplication, to the first application using the pre-post bufferinformation, wherein transferring the data comprises writing the datadirectly into the location in the physical memory of the first computer.

Other aspects of the invention will be apparent from the followingdescription and the appended claims.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 shows a diagram of a blade chassis in accordance with one or moreembodiments of the invention.

FIG. 2 shows a diagram of a blade in accordance with one or moreembodiments of the invention.

FIG. 3 shows a diagram of a network express manager in accordance withone or more embodiments of the invention.

FIG. 4 shows a diagram of a virtual machine in accordance with one ormore embodiments of the invention.

FIG. 5 shows a flowchart of a method for creating a virtual network pathin accordance with one or more embodiments of the invention.

FIGS. 6A-6C show an example of creating virtual network paths inaccordance with one or more embodiments of the invention.

FIGS. 7-8 show flowcharts of a method for low-overhead data transfer inaccordance with one or more embodiments of the invention.

FIG. 9 shows an example of low-overhead data transfer in accordance withone or more embodiments of the invention.

DETAILED DESCRIPTION

Specific embodiments of the invention will now be described in detailwith reference to the accompanying figures. Like elements in the variousfigures are denoted by like reference numerals for consistency.

In the following detailed description of embodiments of the invention,numerous specific details are set forth in order to provide a morethorough understanding of the invention. However, it will be apparent toone of ordinary skill in the art that the invention may be practicedwithout these specific details.

In other instances, well-known features have not been described indetail to avoid unnecessarily complicating the description.

In general, embodiments of the invention provide a method and system forlow-overhead data transfer. More specifically, embodiments of theinvention provide a method and system for enabling two applicationsexecuting on blades within a common blade chassis to communicate usinglow-overhead data transfer.

Further, embodiments of the invention provide a method and system toenable two applications to participate in a zero-copy handshake and thenproceed to communicate using low-overhead data transfer.

In one or more embodiments of the invention, the VNICs are connected toeach other via a chassis interconnect. Specifically, the VNICs may benodes of a virtual network path that includes a “virtual wire” used totransmit network traffic via the chassis interconnect. The concept of avirtual wire is discussed in detail below.

FIG. 1 shows a diagram of a blade chassis (100) in accordance with oneor more embodiments of the invention. The blade chassis (100) includesmultiple blades (e.g., blade A (102), blade B (104)) communicativelycoupled with a chassis interconnect (106). For example, the bladechassis (100) may be a Sun Blade 6048 Chassis by Sun Microsystems Inc.,an IBM BladeCenter® chassis, an HP BladeSystem enclosure by HewlettPackard Inc., or any other type of blade chassis. The blades may be ofany type(s) compatible with the blade chassis (100). BladeCenter® is aregistered trademark of International Business Machines, Inc. (IBM),headquartered in Armonk, N.Y.

In one or more embodiments of the invention, the blades are configuredto communicate with each other via the chassis interconnect (106). Thus,the blade chassis (100) allows for communication between the bladeswithout requiring traditional network wires (such as Ethernet cables)between the blades. For example, depending on the type of blade chassis(100), the chassis interconnect (106) may be a Peripheral ComponentInterface Express (PCI-E) backplane, and the blades may be configured tocommunicate with each other via PCI-E endpoints. Those skilled in theart will appreciate that other connection technologies may be used toconnect the blades to the blade chassis.

Continuing with the discussion of FIG. 1, to communicate with clientsoutside the blade chassis (100), the blades are configured to share aphysical network interface (110). The physical network interface (110)includes one or more network ports (for example, Ethernet ports), andprovides an interface between the blade chassis (100) and the network(i.e., interconnected computer systems external to the blade chassis(100)) to which the blade chassis (100) is connected. The blade chassis(100) may be connected to multiple networks, for example using multiplenetwork ports.

In one or more embodiments, the physical network interface (110) ismanaged by a network express manager (108). Specifically, the networkexpress manager (108) is configured to manage access by the blades tothe physical network interface (110). The network express manager (108)may also be configured to manage internal communications between theblades themselves, in a manner discussed in detail below. The networkexpress manager (108) may be any combination of hardware, software,and/or firmware including executable logic for managing network traffic.

FIG. 2 shows a diagram of a blade (200) in accordance with one or moreembodiments of the invention. “Blade” is a term of art referring to acomputer system located within a blade chassis (for example, the bladechassis (100) of FIG. 1). Blades typically include fewer components thanstand-alone computer systems or conventional servers. In one or moreembodiments of the invention, fully featured stand-alone computersystems or conventional servers may also be used instead of (or incombination with) the blades. Generally, blades in a blade chassis eachinclude one or more processors and associated memory (e.g., RAM, ROM,etc.). Blades may also include storage devices (for example, hard drivesand/or optical drives) and numerous other elements and functionalitiestypical of today's computer systems (not shown), such as a keyboard, amouse, and/or output means such as a monitor. One or more of theaforementioned components may be shared by multiple blades located inthe blade chassis. For example, multiple blades may share a singleoutput device.

Continuing with discussion of FIG. 2, the blade (200) includes a hostoperating system (not shown) configured to execute one or more virtualmachines (e.g., virtual machine C (202), virtual machine D (204)).Broadly speaking, the virtual machines are distinct operatingenvironments configured to inherit underlying functionality of the hostoperating system via an abstraction layer. In one or more embodiments ofthe invention, each virtual machine includes a separate instance of anoperating system (e.g., operating system instance C (206), operatingsystem instance D (208)). For example, the Xen® virtualization projectallows for multiple guest operating systems executing in a hostoperating system. Xen® is a trademark overseen by the Xen ProjectAdvisory Board. In one or more embodiments of the invention, the hostoperating system supports virtual execution environments (not shown). Anexample of virtual execution environment is a Solaris™ Container. Insuch cases, the Solaris™ Container may execute in the host operatingsystem, which may be a Solaris™ operating system. Solaris™ is atrademark of Sun Microsystems, Inc. In one or more embodiments of theinvention, the host operating system may include both virtual machinesand virtual execution environments.

Many different types of virtual machines and virtual executionenvironment exist. Further, the virtual machines may include manydifferent types of functionality, such as a switch, a router, afirewall, a load balancer, an application server, any other type ofnetwork-enabled service, or any combination thereof.

In one or more embodiments of the invention, the virtual machines and/orvirtual execution environments inherit network connectivity from thehost operating system via VNICs (e.g., VNIC C (210), VNIC D (212)). Tothe virtual machines and the virtual execution environments, the VNICsappear as physical NICs. In one or more embodiments of the invention,the use of VNICs allows an arbitrary number of virtual machines and/orvirtual execution environments to share the blade's (200) networkingfunctionality. Further, in one or more embodiments of the invention,each virtual machine and/or virtual execution environment may beassociated with an arbitrary number of VNICs, thereby providingincreased flexibility in the types of networking functionality availableto the virtual machines and/or virtual execution environments. Forexample, a virtual machine may use one VNIC for incoming networktraffic, and another VNIC for outgoing network traffic.

VNICs in accordance with one or more embodiments of the invention aredescribed in detail in commonly owned U.S. patent application Ser. No.11/489,942, entitled “Multiple Virtual Network Stack Instances usingVirtual Network Interface Cards,” in the names of Nicolas G. Droux, ErikNordmark, and Sunay Tripathi, the contents of which are herebyincorporated by reference in their entirety. VNICs in accordance withone or more embodiments of the invention also are described in detail incommonly owned U.S. patent application Ser. No. 11/480,000, entitled“Method and System for Controlling Virtual Machine Bandwidth” in thenames of Sunay Tripathi, Tim P. Marsland, and Nicolas G. Droux thecontents of which are hereby incorporated by reference in theirentirety.

As discussed above, each blade's networking functionality (and, byextension, networking functionality inherited by the VNICs) includesaccess to a shared physical network interface and communication withother blades via the chassis interconnect. FIG. 3 shows a diagram of anetwork express manager (300) in accordance with one or more embodimentsof the invention. The network express manager (300) is configured toroute network traffic traveling to and from VNICs located in the blades.Specifically, the network express manager (300) includes a virtualswitching table (302), which includes a mapping of VNIC identifiers(304) to VNIC locations (306) in the chassis interconnect. In one ormore embodiments, the VNIC identifiers (304) are Internet Protocol (IP)addresses, and the VNIC locations (306) are PCI-E endpoints associatedwith the blades (e.g., if the chassis interconnect is a PCI-Ebackplane). Alternatively, another routing scheme may be used.

In one or more embodiments, the network express manager (300) isconfigured to receive network traffic via the physical network interfaceand route the network traffic to the appropriate location (i.e., wherethe VNIC is located) using the virtual switching table (302). Further,the network express manager (300) may be configured to route networktraffic between different VNICs located in the blade chassis. In one ormore embodiments of the invention, using the virtual switching table(302) in this manner facilitates the creation of a virtual network path,which includes virtual wires. Thus, using the virtual switching table(302), virtual machines located in different blades may beinterconnected to form an arbitrary virtual network topology, where theVNICs associated with each virtual machine do not need to know thephysical locations of other VNICs. Further, if a virtual machine ismigrated from one blade to another, the virtual network topology may bepreserved by updating the virtual switching table (302) to reflect thecorresponding VNIC's new physical location (for example, a differentPCI-E endpoint).

In some cases, network traffic from one VNIC may be destined for a VNIClocated in the same blade, but associated with a different virtualmachine. In one or more embodiments of the invention, a virtual switchmay be used to route the network traffic between the VNICs independentof the blade chassis. Virtual switches in accordance with one or moreembodiments of the invention are discussed in detail in commonly ownedU.S. patent application Ser. No. 11/480,261, entitled “Virtual Switch,”in the names of Nicolas G. Droux, Sunay Tripathi, and Erik Nordmark, thecontents of which are hereby incorporated by reference in theirentirety.

For example, FIG. 4 shows a diagram of a virtual switch (400) inaccordance with one or more embodiments of the invention. The virtualswitch (400) provides connectivity between VNIC X (406) associated withvirtual machine X (402) and VNIC Y (408) associated with virtual machineY (404). In one or more embodiments, the virtual switch (400) is managedby a host operating system (410) within which virtual machine X (402)and virtual machine Y (404) are located. Specifically, the hostoperating system (410) may be configured to identify network traffictargeted at a VNIC in the same blade, and route the traffic to the VNICusing the virtual switch (400). In one or more embodiments of theinvention, the virtual switch (400) may reduce utilization of the bladechassis and the network express manager by avoiding unnecessaryround-trip network traffic.

FIG. 5 shows a flowchart of a method for creating a virtual network pathin accordance with one or more embodiments of the invention. In one ormore embodiments of the invention, one or more of the steps shown inFIG. 5 may be omitted, repeated, and/or performed in a different order.Accordingly, embodiments of the invention should not be consideredlimited to the specific arrangement of steps shown in FIG. 5.

In one or more embodiments of the invention, in Step 502, VNICs areinstantiated for multiple virtual machines. The virtual machines arelocated in blades, as discussed above. Further, the virtual machines mayeach be associated with one or more VNICs. In one or more embodiments ofthe invention, instantiating a VNIC involves loading a VNIC object inmemory and registering the VNIC object with a host operating system,i.e., an operating system that is hosting the virtual machine associatedwith the VNIC. Registering the VNIC object establishes an interfacebetween the host operating system's networking functionality and theabstraction layer provided by the VNIC. Thereafter, when the hostoperating system receives network traffic addressed to the VNIC, thehost operating system forwards the network traffic to the VNIC.Instantiation of VNICs in accordance with one or more embodiments of theinvention is discussed in detail in U.S. patent application Ser. No.11/489,942, incorporated by reference above.

As discussed above, a single blade may include multiple virtual machinesconfigured to communicate with each other. In one or more embodiments ofthe invention, in Step 504, a virtual switch is instantiated tofacilitate communication between the virtual machines. As noted above,the virtual switch allows communication between VNICs independent of thechassis interconnect. Instantiation of virtual switches in accordancewith one or more embodiments of the invention is discussed in detail inU.S. patent application Ser. No. 11/480,261, incorporated by referenceabove.

In one or more embodiments of the invention, in Step 506, a virtualswitching table is populated. As noted above, the virtual switchingtable may be located in a network express manager configured to managenetwork traffic flowing to and from the virtual machines. Populating thevirtual switching table involves associating VNIC identifiers (forexample, Internet Protocol and/or Media Access Control (MAC) addresses)with VNIC locations (for example, PCI-E endpoints). In one or moreembodiments of the invention, the virtual switching table is populatedin response to a user command issued via a control operating system,i.e., an operating system that includes functionality to control thenetwork express manager.

In one or more embodiments of the invention, VNICs include settings forcontrolling the processing of network packets. In one or moreembodiments of the invention, in Step 508, settings are assigned to theVNICs according to a networking policy. Many different types ofnetworking policies may be enforced using settings in the VNICs. Forexample, a setting may be used to provision a particular portion of ablade's available bandwidth to one or more VNICs. As another example, asetting may be used to restrict use of a VNIC to a particular type ofnetwork traffic, such as Voice over IP (VoIP) or Transmission ControlProtocol/IP (TCP/IP). Further, settings for multiple VNICs in a virtualnetwork path may be identical. For example, VNICs in a virtual networkpath may be capped at the same bandwidth limit, thereby allowing forconsistent data flow across the virtual network path. In one or moreembodiments of the invention, a network express manager is configured totransmit the desired settings to the VNICs.

In one or more embodiments of the invention, once the VNICs areinstantiated and the virtual switching table is populated, networktraffic may be transmitted from a VNIC in one blade to a VNIC in anotherblade. The connection between the two VNICs may be thought of as a“virtual wire,” because the arrangement obviates the need fortraditional network wires such as Ethernet cables. A virtual wirefunctions similar to a physical wire in the sense that network trafficpassing through one virtual wire is isolated from network trafficpassing through another virtual wire, even though the network trafficmay pass through the same blade (i.e., using the same virtual machine ordifferent virtual machines located in the blade).

Further, a combination of two or more virtual wires may be thought of asa “virtual network path.” Specifically, transmitting network trafficover the virtual network path involves routing the network trafficthrough a first virtual wire (Step 510) and then through a secondvirtual wire (Step 512). For example, when receiving network trafficfrom a client via the physical network interface, one virtual wire maybe located between the physical network interface and a VNIC, and asecond virtual wire may be located between the VNIC and another VNIC.

FIGS. 6A-6C show an example of creating virtual network paths inaccordance with one or more embodiments of the invention. Specifically,FIG. 6A shows a diagram of an actual topology (600) in accordance withone or more embodiments of the invention, FIG. 6B shows how networktraffic may be routed through the actual topology (600), and FIG. 6Cshows a virtual network topology (640) created by routing networktraffic as shown in FIG. 6B. FIGS. 6A-6C are provided as examples only,and should not be construed as limiting the scope of the invention.

Referring first to FIG. 6A, the actual topology (600) includes multiplevirtual machines. Specifically, the actual topology (600) includes arouter (602), a firewall (604), application server M (606), andapplication server N (608), each executing in a separate virtualmachine. The virtual machines are located in blades communicativelycoupled with a chassis interconnect (622), and include networkingfunctionality provided by the blades via VNICs (i.e., VNIC H (610), VNICJ (612), VNIC K (614), VNIC M (618), and VNIC N (620)). For ease ofillustration, the blades themselves are not included in the diagram.

In one or more embodiments of the invention, the router (602), thefirewall (604), application server M (606), and application server N(608) are each located in separate blades. Alternatively, as notedabove, a blade may include multiple virtual machines. For example, therouter (602) and the firewall (604) may be located in a single blade.Further, each virtual machine may be associated with a different numberof VNICs than the number of VNICs shown in FIG. 6A.

Continuing with discussion of FIG. 6A, a network express manager (624)is configured to manage network traffic flowing to and from the virtualmachines. Further, the network express manager (624) is configured tomanage access to a physical network interface (626) used to communicatewith client O (628) and client P (630). In FIG. 6A, the virtualmachines, VNICs, chassis interconnect (622), network express manager(624), and physical network interface (626) are all located within achassis interconnect. Client O (628) and client P (630) are located inone or more networks (not shown) to which the chassis interconnect isconnected.

FIG. 6B shows how network traffic may be routed through the actualtopology (600) in accordance with one or more embodiments of theinvention. In one or more embodiments of the invention, the routing isperformed by the network express manager (624) using a virtual switchingtable (634).

As discussed above, network traffic routed to and from the VNICs may bethough of as flowing through a “virtual wire.” For example, FIG. 6Bshows a virtual wire (632) located between application server M (606)and application server N (608). To use the virtual wire, applicationserver M (606) transmits a network packet via VNIC M (618). The networkpacket is addressed to VNIC N (620) associated with application server N(608). The network express manager (624) receives the network packet viathe chassis interconnect (622), inspects the network packet, anddetermines the target VNIC location using the virtual switching table(634). If the target VNIC location is not found in the virtual switchingtable (634), then the network packet may be dropped. In this example,the target VNIC location is the blade in which VNIC N (620) is located.The network express manager (624) routes the network packet to thetarget VNIC location, and application server N (608) receives thenetwork packet via VNIC N (620), thereby completing the virtual wire(632). In one or more embodiments of the invention, the virtual wire(632) may also be used to transmit network traffic in the oppositedirection, i.e., from application server N (608) to application server M(606).

Further, as discussed above, multiple virtual wires may be combined toform a “virtual network path.” For example, FIG. 6B shows virtualnetwork path R (636), which flows from client O (628), through therouter (602), through the firewall (604), and terminates at applicationserver M (606). Specifically, the virtual network path R (636) includesthe following virtual wires. A virtual wire is located between thephysical network interface (626) and VNIC H (610). Another virtual wireis located between VNIC J (612) and VNIC K (614). Yet another virtualwire is located between VNIC L (616) and VNIC M (618). If the router(602) and the firewall (604) are located in the same blade, then avirtual switch may be substituted for the virtual wire located betweenVNIC J (612) and VNIC K (614), thereby eliminating use of the chassisinterconnect (622) from communications between the router (602) and thefirewall (604).

Similarly, FIG. 6B shows virtual network path S (638), which flows fromclient P (630), through the router (602), and terminates at applicationserver N (608). Virtual network path S (638) includes a virtual wirebetween the physical network interface (626) and VNIC H (610), and avirtual wire between VNIC J (612) and VNIC N (620). The differencesbetween virtual network path R (636) and virtual network path S (638)exemplify how multiple virtual network paths may be located in the sameblade chassis.

In one or more embodiments of the invention, VNIC settings are appliedseparately for each virtual network path. For example, differentbandwidth limits may be used for virtual network path R (636) andvirtual network path S (638).

Thus, the virtual network paths may be thought of as including many ofthe same features as traditional network paths (e.g., using Ethernetcables), even though traditional network wires are not used within theblade chassis. However, traditional network wires may still be requiredoutside the blade chassis, for example between the physical networkinterface (626) and client O (628) and/or client P (630).

FIG. 6C shows a diagram of the virtual network topology (640) resultingfrom the use of the virtual network path R (636), virtual network path S(638), and virtual wire (632) shown in FIG. 6B. The virtual networktopology (640) allows the various components of the network (i.e.,router (602), firewall (604), application server M (606), applicationserver N (608), client O (628), and client P (630)) to interact in amanner similar to a traditional wired network. However, as discussedabove, communication between the components located within the bladechassis (i.e., router (602), firewall (604), application server M (606),and application server N (608)) is accomplished without the use oftraditional network wires.

In one embodiment of the invention, data may be transferred betweenvirtual machines executing on different blades in a blade chassis usingTransmission Control Protocol (TCP) and Internet Protocol (IP). Further,data may also be transferred between the virtual machines usinglow-overhead data transfers. In particular, data may be transferreddirectly from physical memory on one blade to physical memory on anotherblade.

More specifically, the virtual machine (or application executingtherein) may establish a TCP connection with another virtual machine andthen, using the TCP connection, perform a zero-copy handshake. In oneembodiment of the invention, the zero-copy handshake involvesdetermining whether the virtual machines are able to communicate usinglow-overhead data transfer and if the virtual machines (or applicationsexecuting therein) want to transfer data using low-overhead datatransfer. In one embodiment of the invention, the virtual machines maycommunicate using a combination of data transfer over TCP/IP and datatransfer using low-overhead data transfer.

In one embodiment of the invention, low-overhead data transfer isachieved by allowing the direct transfer of data from the virtual memoryassociated with a sending application (executing in a first virtualmachine) to the virtual memory of a receiving application (executing ina second virtual machine), where the first application is executing on afirst blade and the second application is executing on a second blade.In one embodiment of the invention, the target virtual memory addressfor the transfer must be provided prior to the transfer of data. If thereceiving application is executing in a guest operating system(executing in a virtual machine), which in turn is executing in a hostoperating system, then the receiving application must provide thesending application (or a related process) a physical memory address(which corresponds to the virtual memory associated with the receivingapplication) for a buffer to which to transfer the data. However, thereceiving application is only able to provide a virtual memory addressfor the receiving application. This virtual memory address must betranslated one or more times in order to obtain the underlying physicalmemory address. The process of translation is described in FIG. 7 below.Once the translation is complete, the physical memory address (as wellas any other necessary information) is provided to the sendingapplication (or a related process) to perform low-overhead data transferas described in FIG. 8.

FIG. 7 shows a flowchart of a method for pre-posting buffers for anapplication prior to the application using low-overhead data transfer.In one or more embodiments of the invention, one or more of the stepsshown in FIG. 7 may be omitted, repeated, and/or performed in adifferent order than the order shown in FIG. 7. Accordingly, embodimentsof the invention should not be considered limited to the specificarrangement of steps shown in FIG. 7.

In Step 700, an application specifies a pre-post buffer address. In oneembodiment of the invention, the pre-post buffer address is a virtualmemory address in virtual memory associated with the application. In oneembodiment of the invention, the pre-post buffer address may refer to abuffer that is greater than 1 megabyte in size. In Step 702, the guestoperating system receives and translates the pre-post buffer addressinto a guest OS virtual memory address. In one embodiment of theinvention, the guest OS virtual memory address is a virtual memoryaddress in a virtual memory associated with the guest operating system.

In Step 704, the guest operating system provides the guest OS virtualmemory address to the host operating system. In Step 706, the hostoperating system receives and translates the guest OS virtual memoryaddress into a host OS virtual memory address. Based on the host virtualmemory address, the operating system may determine the underlyingphysical memory address corresponding to the host OS virtual memoryaddress. The physical memory address corresponds to the host OS virtualmemory address is the same physical memory address which corresponds tothe pre-post buffer address.

In one embodiment of the invention, the host operating system notifiesthat the guest operating system that the per-post buffer address hasbeen successfully pre-posted. The guest operating may, in turn, notifythe application that the pre-post buffer address has been successfullypre-posted. In addition, the host operating system may maintain thetranslated physical address and any other related information(collectively referred to as “pre-post buffer information”).

At this stage, the application may now participate in low-overhead datatransfer. More specifically, the application may receive data usinglow-overhead data transfer. Those skilled in the art will appreciate theFIG. 7 may be repeated multiple times for a given application in orderfor the application to pre-post multiple buffers for use in low-overheaddata transfer. Further, the application may also send data to anotherapplication using low-overhead data transfer if the other applicationalso pre-posts buffers using, for example, the method shown in FIG. 7.

FIG. 8 shows a flowchart of a method for initiating and usinglow-overhead data transfer. In one or more embodiments of the invention,one or more of the steps shown in FIG. 8 may be omitted, repeated,and/or performed in a different order than the order shown in FIG. 8.Accordingly, embodiments of the invention should not be consideredlimited to the specific arrangement of steps shown in FIG. 8.

In Step 800, Application A attempts to initiate a TCP connection withApplication B. In one embodiment of the invention, Application Aprovides an IP address assigned to the virtual machine (or assigned tothe VNIC associated with the virtual machine) on which Application B isexecuting. In addition, Application A may also provide a port number.

In Step 802, the guest OS kernel, in response the request fromApplication A to initiate the TCP connection, creates socket A. In oneembodiment of the invention, socket A is a kernel level processidentified by the IP-Port Number pair and is a communication end-pointconfigured to interface with Application A and the VNIC executing on thehost operating system (on which the guest OS is executing). In Step 804,a TCP connection is initiated by socket A. In Step 806, socket Bresponds to the connection request and a TCP connection is established.

In Step 808, the zero-copy handshake is initiated. In one embodiment ofthe invention, the zero-copy handshake is an exchange of data designedto establish whether two applications may transfer data usinglow-overhead data transfer. In one embodiment of the invention, thezero-copy handshake is initiated when Application A sends one or morerequests to Application B to determine whether Application A andApplication B may transfer data using low-overhead data transfer. In oneembodiment of the invention, the request may include placing a specificmarker in a TCP SYN packet.

In one embodiment of the invention, instead of the applicationsinitiating the zero-copy handshake, the VNICs executing on therespective host operating systems (see FIG. 9 below) may initiate andsubsequently perform the zero-copy handshake. In such cases, one or bothof the applications, prior to the initiation of the TCP connection haveindicated that they are able to transfer data using low-overhead datatransfer and have performed the method shown in FIG. 7 to obtain thepre-post buffer information.

In Step 810, as part of the zero-copy handshake, a determination is madeabout whether Application A and Application B are connected over a localTCP connection. In one embodiment of the invention, Application A andApplication B are connected over a local TCP connection when bothApplication A and Application B are executing on blades within the sameblade chassis. If Application A and Application B are connected over alocal TCP connection, the process proceeds to Step 812. Alternatively,the process proceeds to Step 820. In Step 820, Applications A and Bcommunicate using TCP/IP.

In Step 812, as part of the zero-copy handshake, a determination is madeabout whether Application B wants to participate in low-overhead datatransfer. In one embodiment of the invention, this determination mayinclude either of the following determinations: (i) Application B willsend data to Application A using low-overhead data transfer but willonly receive data from Application A via TCP/IP; and (ii) Application Bwill send data to Application A using low-overhead data transfer andApplication B will receive data from Application A using low-overheaddata transfer. If Application B wants to participate in low-overheaddata transfer, then the process proceeds to Step 814. Alternatively, theprocess proceeds to Step 820 (i.e., Application B does not want toparticipate in either of the aforementioned scenarios). In oneembodiment of the invention, the zero-copy handshake is performed overthe TCP connection.

In Step 814, Application B is provided with Application A's pre-postbuffer information. In Step 816, depending on the determination in Step812, Application A may be provided with Application B's pre-post bufferinformation. In one embodiment of the invention, the informationtransferred in Step 814 and Step 816 are communicated over the TCPconnection. In Step 818, Applications A and B participate inlow-overhead data transfer.

In one or more embodiments of the invention, low-overhead data transferfrom Application A to Application B uses, for example, a Direct MemoryAccess (DMA) operation, where the DMA operation uses as inputApplication B's pre-post buffer information. Those skilled in the artwill appreciate that other write operations (e.g., RDMA) may be used towrite data directly from one physical memory location to anotherphysical memory on different blades.

In one embodiment of the invention, the low-overhead transfer isperformed by DMA (or RDMA) engines executing in (or managed by) therespective host operating systems. Further, because the data transfer isdirectly from the one blade to another, the data transfer does notrequire the additional processing overhead associated with othertransfer protocols such as TCP. Further, in one embodiment of theinvention, the low-overhead data transfer may use the underlying errordetection and correction functionality of the chassis interconnect toensure that data is transferred in an uncorrupted manner.

In one embodiment of the invention, once data from Application B istransferred to Application A using the low-overhead data transfer,Application A is notified of the presence of the data. In one embodimentof the invention, Application A receives the notification from the guestoperating system on which it is executing. Further, the guest operatingsystem is notified by the host operating system on which it isexecuting. Finally, the host operating system is notified by ApplicationB, the guest operating system on which Application B is executing, orthe host operating system on which the aforementioned guest operatingsystem is executing (or a process executing thereon).

In one embodiment of the invention, Application A and Application B maycommunicate using both TCP/IP and low-overhead data transfer. Forexample, TCP/IP may be used for all communication of a certain type(e.g., all files in a specific file format) and/or less than a certainsize and low-overhead data transfer may be used for all communication ofanother type and/or greater than a certain size.

FIG. 9 shows an example of low-overhead data transfer in accordance withone or more embodiments of the invention. FIG. 9 is provided forexemplary purposes only and should not be construed as limiting thescope of the invention. Referring to FIG. 9, blade A (900) and blade B(902) are each communicatively coupled with a chassis interconnect(912). Application A (908) in blade A (900) is configured to communicatewith application B (910) in blade B (902) via a TCP connection havingsocket A (918) and socket B (920) as endpoints. Specifically, socket A(918) is configured to transfer data to socket B (902) by way of VNIC A(926), VNIC B (928), and the chassis interconnect (912). Further,application A (908) is executing in virtual machine A (904) on guest OSA (not shown) and application B (910) is executing in virtual machine B(906) on guest OS B (not shown).

Based on the above, consider the scenario in which application A (908)and application B (910) each have performed the method described in FIG.7 to generate buffer pre-post information. More specifically,application A (908) allocated per-post buffer A (not shown) inApplication A virtual memory (VM) (914). The virtual memory addressassociated with per-post buffer A is then translated to a guestoperating system VM (922) address. The guest operating system VM (922)address is then translated by the host operating system A (930) toobtain a host VM address from the host VM (934), which corresponds to anunderlying physical memory address. A similar process is performed forApplication B (910) and using Application B VM (916) and translating toa guest operating system VM (924) address and finally to an underlyingphysical memory address which corresponds to the a host VM address inhost VM (936).

Using the above pre-post buffer information, the applications maycommunicate as follows in accordance with one embodiment of theinvention. Specifically, application A (908) is configured to request aTCP connection with application B (910) for transferring data. Socket A(918) initiates a TCP connection with socket B (920) via VNIC A (926) toVNIC B (928).

Once the TCP connection is established, the zero-copy handshake isperformed. Specifically, a determination is made by VNIC A (926) thatApplication A (908) and Application B (910) are connected over a localTCP connection. A further determination is made that Application B (910)will send data to Application A (908) using low-overhead data transferand Application B (910) will receive data from Application A (908) usinglow-overhead data transfer.

In one or more embodiments of the invention, VNIC A (926) then passesApplication A's pre-post buffer information to VNIC B (928) and VNIC B(928) passes Application B's pre-post buffer information to VNIC A(926). The applications may then transfer data using low-overhead datatransfer.

In one embodiment of the invention, data from application B (910) istransferred using a RDMA engine and the application A's pre-post bufferinformation directly to applications A's VM (914), where the RDMA enginelocated on blade B (902) and is managed by VNIC B (928). Prior to thetransfer, VNIC A may compare the location in the physical memoryreceived from VNIC B with an allowed address range associated withapplication A to determine whether the data may be transferred to thelocation in memory specified by the pre-post buffer information. If thelocation in physical memory received by VNIC A is outside the allowedaddress range, then the transfer may be denied.

Embodiments of the invention may be also be used to transfer dataapplications by using embodiments of the invention to transfer databetween virtual machines (e.g., virtual machine A (904) and virtualmachine B (906)). For example, referring to FIG. 9, to send data fromapplication A (908) to application B (910). Application A (908) maytransfer data over the connection to VNIC A (926). VNIC A (926) inaccordance with embodiments of the invention, obtains pre-post bufferfor virtual machine B (906) and subsequently transfers the data using,for example, a RDMA engine directly to the virtual Guest OS B VM (924).Upon receipt, the data is copied into application B VM (916). In suchcases, the virtual machines, as opposed to the applications, are awareof the ability to transfer data using low-overhead data transfer.However, the applications are not aware of this functionality. Further,the applications, in this scenario, do not need to include functionalityto pre-post buffers. Instead, the virtual machines need to includefunctionality to pre-post buffers.

Those skilled in the art will appreciate that while the invention hasbeen described with respect to using blades, the invention may beextended for use with other computer systems, which are not blades.Specifically, the invention may be extended to any computer, whichincludes at least memory, a processor, and a mechanism to physicallyconnect to and communicate over the chassis interconnect. Examples ofsuch computers include, but are not limited to, multi-processor servers,network appliances, and light-weight computing devices (e.g., computersthat only include memory, a processor, a mechanism to physically connectto and communicate over the chassis interconnect), and the necessaryhardware to enable the aforementioned components to interact.

Further, those skilled in the art will appreciate that if one or morecomputers, which are not blades, are not used to implement theinvention, then an appropriate chassis may be used in place of the bladechassis.

Software instructions to perform embodiments of the invention may bestored on a computer readable medium such as a compact disc (CD), adiskette, a tape, or any other computer readable storage device.

While the invention has been described with respect to a limited numberof embodiments, those skilled in the art, having benefit of thisdisclosure, will appreciate that other embodiments can be devised whichdo not depart from the scope of the invention as disclosed herein.Accordingly, the scope of the invention should be limited only by theattached claims.

What is claimed is:
 1. A method for low-overhead data transfer,comprising: initiating, by a first application, a TransmissionCommunication Protocol (TCP) connection with a second application,wherein the first application is executing on a first computer in afirst virtual machine, the second application is executing on a secondcomputer in a second virtual machine, and the first computer and thesecond computer are located on a chassis and communicate over a chassisinterconnect, wherein the first computer is directly connected to thechassis interconnect at a first Peripheral Component Interface Express(PCI-E) end point and the second computer is directly connected to thechassis interconnect at a second PCI-E endpoint; establishing, inresponse to the initiation, the TCP connection between the firstapplication and the second application; selecting, by a first VirtualNetwork Interface Card (VNIC), a second protocol from a group consistingof the first protocol and the second protocol based on a determinationthat the TCP connection between the first application and the secondapplication is a local TCP connection, wherein the first VNIC is locatedon the first computer and is interposed between the first virtualmachine and the chassis interconnect, wherein the TCP connection is thelocal TCP connection when the first application and the secondapplication are executing on separate physical computers connected tothe chassis, and wherein the first protocol is TCP and the secondprotocol comprises using a low-overhead data transfer; based on theselection of the second protocol: providing, by the first application,pre-post buffer information to the second application, wherein thepre-post buffer information corresponds to a location in a physicalmemory of the first computer and wherein the location in physical memorycorresponds to a virtual memory address of the first application; andtransferring data, by the second application, to the first applicationusing the pre-post buffer information, wherein transferring the datacomprises writing the data directly into the location in the physicalmemory of the first computer.
 2. The method of claim 1, furthercomprising: generating the pre-post information, wherein generating thepre-post information comprises: allocating the virtual memory address invirtual memory associated with the first application; providing thevirtual memory address to a guest operating system (OS) executing thefirst application, wherein the guest OS is executing in the firstvirtual machine; translating the virtual memory address to obtain aguest OS virtual memory address associated with the guest operatingsystem; providing the guest OS virtual memory address to a hostoperating system upon which the guest operating system is executing;translating the virtual memory address to obtain a host OS virtualmemory address associated with the host operating system, wherein thehost OS virtual memory address corresponds to the location in thephysical memory of the first computer.
 3. The method of claim 1, whereinthe pre-post information is provided to the over the TCP connection andwherein the pre-post information is provided to the first VNIC.
 4. Themethod of claim 3, wherein the first VNIC is configured to compare thelocation in the physical memory received from a second VNIC with anallowed address range associated with the first application to determinewhether the data may be transferred to the location in the physicalmemory, wherein the second VNIC is located on the second computer. 5.The method of claim 1, wherein the second application provides a secondvirtual network interface card (VNIC) located on the second computerwith a location of physical memory associated with the TCP connection.6. The method of claim 5, wherein transferring the data comprises:writing, by the second VNIC, the data to the location in the physicalmemory of the first computer using remote direct memory access (RDMA)and the location in the physical memory of the first computer.
 7. Themethod of claim 5, wherein the first VNIC and the second VNIC are nodesin a virtual network path, wherein the virtual network path comprises afirst virtual wire between the first VNIC and the second VNIC.
 8. Themethod of claim 1, wherein the first computer and the second computerare blades.
 9. A system comprising: a chassis interconnect; and a firstapplication is executing on a first computer in a first virtual machineand a second application is executing on a second computer in a secondvirtual machine, wherein the first computer and the second computer arelocated on a chassis and communicate over the chassis interconnect,wherein the first computer is directly connected to the chassisinterconnect at a first Peripheral Component Interface Express (PCI-E)end point and the second computer is directly connected to the chassisinterconnect at a second PCI-E endpoint, wherein the first applicationis configured to initiate a Transmission Communication Protocol (TCP)connection with the second application, wherein, in response to theinitiation, the TCP connection is established between the firstapplication and the second application, wherein a first virtual networkinterface card, executing on the first computer and interposed betweenthe first virtual machine and the chassis interconnect, is configured toselect a second protocol from a group consisting of the first protocoland the second protocol based on a determination that the TCP connectionbetween the first application and the second application is a local TCPconnection, wherein the TCP connection is the local TCP connection whenthe first application and the second application are executing onseparate physical computers connected to the chassis, and wherein thefirst protocol is TCP and the second protocol comprises using alow-overhead data transfer, based on the selection of the secondprotocol, the first application is configured to provide pre-post bufferinformation to the second application after the first application isdetermined to be executing on the same chassis as the secondapplication, wherein the pre-post buffer information corresponds to alocation in a physical memory of the first computer and wherein thelocation in physical memory corresponds to a virtual memory address ofthe first application, and wherein the second application transfers datato the first application using the pre-post buffer information, whereintransferring the data comprises writing the data directly into thelocation in the physical memory of the first computer.
 10. The system ofclaim 9, wherein the pre-post information is generated by: allocatingthe virtual memory address in virtual memory associated with the firstapplication; providing the virtual memory address to a guest operatingsystem (OS) executing the first application, wherein the guest OS isexecuting in the first virtual machine; translating the virtual memoryaddress to obtain a guest OS virtual memory address associated with theguest operating system; providing the guest OS virtual memory address toa host operating system upon which the guest operating system isexecuting; translating the virtual memory address to obtain a host OSvirtual memory address associated with the host operating system,wherein the host OS virtual memory address corresponds to the locationin the physical memory of the first computer.
 11. The system of claim 9,wherein the pre-post information is provided to the over the TCPconnection.
 12. The system of claim 9, wherein the second applicationprovides a second VNIC located on the second computer with a location ofphysical memory associated with the TCP connection.
 13. The system ofclaim 12, wherein transferring the data comprises: writing, by thesecond VNIC, the data to the location in the physical memory of thefirst computer using remote direct memory access (RDMA) and the locationin the physical memory of the first computer.
 14. The system of claim12, wherein second virtual machine is configured to directly transferdata from the first virtual machine to a location in the physical memoryof the first computer, wherein the second VNIC transfers the data usinga remote direct memory access (RDMA) engine.
 15. The system of claim 9,wherein the first computer and the second computer are blades.
 16. Anon-transitory computer readable medium comprising a plurality ofexecutable instructions for low-overhead data transfer, wherein theplurality of executable instructions comprises instructions to:initiate, by a first application, a Transmission Communication Protocol(TCP) connection with a second application, wherein the firstapplication is executing on a first computer in a first virtual machine,the second application is executing on a second computer in a secondvirtual machine, and the first computer and the second computer arelocated on a chassis and communicate over a chassis interconnect,wherein the first computer is directly connected to the chassisinterconnect at a first Peripheral Component Interface Express (PCI-E)end point and the second computer is directly connected to the chassisinterconnect at a second PCI-E endpoint; establish, in response to theinitiation, the TCP connection between the first application and thesecond application; select, by a first Virtual Network Interface Card(VNIC), a second protocol from a group consisting of the first protocoland the second protocol based on a determination that the TCP connectionbetween the first application and the second application is a local TCPconnection, wherein the first VNIC is located on the first computer andis interposed between the first virtual machine and the chassisinterconnect, wherein the TCP connection is the local TCP connectionwhen the first application and the second application are executing onseparate physical computers connected to the chassis, and wherein thefirst protocol is TCP and the second protocol comprises using alow-overhead data transfer; based on the selection of the secondprotocol: provide, by the first application, pre-post buffer informationto the second application, wherein the pre-post buffer informationcorresponds to a location in a physical memory of the first computer andwherein the location in physical memory corresponds to a virtual memoryaddress of the first application; and transfer data, by the secondapplication, to the first application using the pre-post bufferinformation, wherein transferring the data comprises writing the datadirectly into the location in the physical memory of the first computer.17. The non-transitory computer readable medium of claim 16, wherein theplurality of executable instructions further comprises instructions to:generate the pre-post information, wherein generating the pre-postinformation comprises: allocating the virtual memory address in virtualmemory associated with the first application; providing the virtualmemory address to a guest operating system (OS) executing the firstapplication, wherein the guest OS is executing in the first virtualmachine; translating the virtual memory address to obtain a guest OSvirtual memory address associated with the guest operating system;providing the guest OS virtual memory address to a host operating systemupon which the guest operating system is executing; translating thevirtual memory address to obtain a host OS virtual memory addressassociated with the host operating system, wherein the host OS virtualmemory address corresponds to the location in the physical memory of thefirst computer.
 18. The non-transitory computer readable medium of claim16, wherein the TCP connection is established between the first VNIClocated on the first computer and a second VNIC located on the secondcomputer, wherein the first computer and the second computer are blades.19. The non-transitory computer readable medium of claim 18, writing, bythe second VNIC, the data to the location in the physical memory of thefirst computer using a remote direct memory access (RDMA) engine.